Techniques to produce and evaluate realistic multivariate synthetic data

Data modeling requires a sufficient sample size for reproducibility. A small sample size can inhibit model evaluation. A synthetic data generation technique addressing this small sample size problem is evaluated: from the space of arbitrarily distributed samples, a subgroup (class) has a latent multivariate normal characteristic; synthetic data can be generated from this class with univariate kernel density estimation (KDE); and synthetic samples are statistically like their respective samples. Three samples (n = 667) were investigated with 10 input variables (X). KDE was used to augment the sample size in X. Maps produced univariate normal variables in Y. Principal component analysis in Y produced uncorrelated variables in T, where the probability density functions were approximated as normal and characterized; synthetic data was generated with normally distributed univariate random variables in T. Reversing each step produced synthetic data in Y and X. All samples were approximately multivariate normal in Y, permitting the generation of synthetic data. Probability density function and covariance comparisons showed similarity between samples and synthetic samples. A class of samples has a latent normal characteristic. For such samples, this approach offers a solution to the small sample size problem. Further studies are required to understand this latent class.

www.nature.com/scientificreports/ roughly d 2 parameters, and deep neural network architectures have even greater number of parameters, requiring large sample sizes for a given design 15 (see related table in 15 for examples). These modeling techniques illustrate that an adequate sample size under one condition may not be optimal for another. It is our premise that the adequate sample size for a given multivariate prediction problem that allows independent validation deserves more attention beyond larger sample sizes are better, especially when normality assumptions do not hold. By hypothesis, a technique that can generate realistic synthetic data will provide benefits in modeling endeavors. Such an approach could be used to augment an inadequate sample size for modeling and validation purposes or to study sample size requirements for a given multivariate covariance structure. Synthetic data applications in health-related research use a variety of techniques. Some methods are used for generating samples from large populations [16][17][18][19] . These approaches include hidden Markov models 18 , techniques that reconstruct time series data coupled with sampling the empirical probability density function of the relevant variables 16 , and methods that estimate probability density functions (pdfs) from the data, not accounting for variable correlation 19 . Other work used moment matching to generate synthetic data but does not consider relative frequencies in the comparison analysis 20 . Discussions on synthetic data generation techniques indicate that the small sample size condition has received little analytical attention 20,21 .
Our synthetic data technique estimates a multivariate pdf for arbitrarily distributed data, including when normal approximations fail to hold 22 . This initial work, based on multivariate kernel density estimation (mKDE) with unconstrained bandwidths, was illustrated with d = 5 data from mammographic case-control data 22 . Synthetic populations (SPs) of arbitrarily large size were generated from samples of limited size. However, mKDE has noted efficiency problems for high dimensionality 23,24 . As the dimensionality increases, the sample size requirement apparently becomes exceeding large, noting this area is under investigation. Although categorizing a problem as high or low dimensionality may be dependent on many factors, reasonable arguments suggest that high dimensionality may be defined as 3 < d ≤ 50, where d is the number of variables considered 25 ; in that, density estimators should be able to address this range 25 . Here we let d = 10 so that many of the findings can be presented graphically or reasonably tabulated, and modeling problems in healthcare research can be within this range.
In this report, we present modifications to our method to mitigate the mKDE efficiency problem under specific conditions (latent normality) and address synthetic data generation in relatively higher dimensionality (d = 10). This modified approach decomposes an arbitrarily distributed multivariate problem into multiple univariate KDE (uKDE) problems while characterizing the covariance structure independently 26 . We are evaluating whether this approach can transform an arbitrarily distributed multivariate sample into an approximate multivariate normal form, which we define as a sample with a latent normal characteristic. In the universe of arbitrarily distributed samples, there is a multivariate normal subgroup. The technique for generating synthetic data for this normal subgroup is relatively straightforward and well-practiced. By hypothesis, our approach seeks to extend these straightforward techniques to the latent normal class by determining when (or if) it exists. Developing the analytics to detect this condition and then leveraging it to generate synthetic data are essential elements of our work 26 .

Methods
Overview. Our modified synthetic data generation and analytic techniques have sequential components and many related analyses. Therefore, clear definitions, preliminaries, and a brief outline are given before the details are provided. Justifications are also discussed here when warranted.
Definitions. Population is used to define a hypothetical collection of virtually unlimited number of either real or synthetic entities from which samples comprised of observations or realizations may exist or can be drawn. The exception for the use of population is when explaining differential evolution (DE) optimization 27 used for uKDE bandwidth determination. The DE-population is limited and defined specifically. Sample defines a collection of n real observations with d attributes (variables) from the space of possible samples, represented mathematically as a n × d matrix (rows = observations, columns = attributes). Column vectors are designated with lower-case bold letters. For example, individual attributes are referred to as x, a column vector. Vector components are designated with lower-case subscripted letters. The components of x are referenced as x j for j = 1, 2, …, d and assumed continuous. The multivariate pdf for x is p(x). X refers to the input variable space, that is, x exists in the X representation. We assume p(x) exists at the population level, but not accessible. In practice, we evaluated normalized histograms throughout this work for all variables considered both univariate and multivariate (i.e., empirical pdfs), also referred to as pdfs for brevity; we use this term to refer to attributes at the population level as well as at the sample level. One-dimensional (1D) marginal pdfs for p(x) are expressed as p j (x j ). Matrices are designated with upper case bold lettering. For example, X is the n × d matrix with n observations of x in its rows (i.e., the ith row contains the d attributes for the ith observation and the jth column of x has n realizations of x j ). Double subscripts are used for both specific realizations and matrix element indices. That is, x ij is the jth component for the ith realization in X (also is the indexing for X). Variables in X are mapped to the Y representation. This creates the corresponding entities in Y: (1) the vector y with d components; (2) the multivariate pdf, g(y), and its marginal pdfs, g j (y j ); and (3) the matrix Y defined analogously as X. We also work in the T representation (uncorrelated variables) as explained below, where t, t j , and T are defined similarly. Likewise, r(t) is the multivariate pdf in T with marginals, r j (t j ). We define the cumulative probability functions (i.e., the indefinite integral approximation of a given univariate pdf) for p j (x j ) and g j (y j ) as P j (x j ) and G j (y j ), respectively. Covariance quantities are calculated with the normal multivariate form: E(w − m w )(v − m v ), where E is the expectation operator, w and v are arbitrary random variables with means m w and m v . The corresponding covariance matrices are expressed as C x , C y , and C t , respectively (or C k generically). When an entity is given the subscript, s, it then defines the corresponding synthetic entity. Standardized normal defines a zero mean-unit variance normal pdf, www.nature.com/scientificreports/ used in both the univariate and multivariate scenarios. Parametric for this report refers to functions that can be expressed in closed form.
Preliminaries. General biomedical healthcare data characteristics are discussed to overview some of their characteristics. Measurements such as body mass index (BMI) and age, or measurements taken from image data can  have right-skewed pdfs because they are often positive-valued and not inclusive of zero (see Figs. 1, 2 and 3 for  examples). Such measures can bear varying levels of correlation (see lower parts of Tables 1, 2 and 3). Thus, an arbitrary p(x) may not lend itself to parametric modeling in X (i.e., normality and non-correlation assumptions do not apply in many instances). To render such data into a more tractable form, a series of steps (see Fig. 4) were taken to condition X; by premise, these steps will permit characterizing the sample with parametric means and then generating similar multivariate synthetic data without mKDE.
Outline of the processing steps. When describing these steps, we also briefly discuss the analysis at a given step (also provided in detail in the methods). The process starts with a given sample in X (Fig. 4, top left). The processing flow for the sample (X-Y-T) is illustrated in the top row of Fig. 4, and the reversed SP generation flow (T s -Y s -X s ) in the bottom row.
Step 1 Univariate maps were constructed (Fig. 4, top-left) to transform a given X measurement to a standardized normal, producing the respective marginal pdf set in Y (Fig. 4, top-middle). Maps were constructed with Figure 1. Marginal Probability Density Functions (pdfs) for Sample 1 (DS1) in the X representation: each pdf for DS1 (solid) is compared with its corresponding pdf from synthetic data (dashes). The x-axis cites the variable name from its respective resource and its index name parenthetically (x j ). www.nature.com/scientificreports/ an augmented sample size using optimized uKDE, addressing the small sample size problem. uKDE was used to generate synthetic x j . Here, we augmented the sample size with the goal of filling gaps in the input marginal pdfs of x j (sample) to complement the map constructions. By hypothesis, this step addresses the small sample size problem by guaranteeing continuous smooth maps that will produce standardized normal pdfs from the sample. Synthetic x j generated in this fashion do not maintain the covariance relationships in X and were not used further; only x j from the sample were mapped to Y, and KDE was not used beyond this point. There is no guarantee that a set of normal marginals in Y will produce a multivariate normal pdf. Although the reverse is always true because a multivariate normal has univariate normal marginals. In practice, g(y) from the sample could be assessed at this point to estimate how well it approximates normality. If the latent normal approximation is poor, another synthetic approach could be pursued, or the process could be discontinued. Here we forgo such testing at this step (normality was tested for in steps 3 and 4 instead) and move through all steps to illustrate the techniques. We will also discuss a possible modification that could be investigated when the sample has a poor latent normal characteristic approximation, later in the discussion.
Step 2 Principal component analysis (PCA) was used to decouple the variables in Y producing uncorrelated variables in T (Fig. 4, top-right).
Step 3 Synthetic data was generated in T (Fig. 4, bottom-left) as uncorrelated random variables. Here we assumed that each marginal in T from the sample could be approximated as normal with variance given by the jth eigenvalue of C y . To generate synthetic data, the columns of T s were populated as normally distributed random variables with these specified variances (noting, the columns lack correlation). We refer to the realizations in T as the SP (i.e., T s ), noting the column length (number of synthetic entities) can be arbitrarily large.

Figure 2.
Marginal Probability Density Functions (pdfs) for Sample 2 (DS2) in the X representation: each pdf for DS2 (solid) is compared with its corresponding pdf from synthetic data (dashes). The x-axis cites the variable name from its respective resource and its index name parenthetically (x j ). www.nature.com/scientificreports/ To address the normal characteristic, univariate marginal and multivariate pdfs from the sample were tested for normality at this point.
Step 4 The inverse PCA transform (Fig. 4, bottom-middle) of T s produced the SP in Y (Y s ), thereby restoring the covariance relationships that were removed in Step 2. We note, this technique (steps 3 and 4) of producing multivariate normal data is a practiced approach when the sample is multivariate normal or well approximated as such. For reference, the multivariate standardized normal in Y is expressed as where C y can be approximated as the covariance matrix from a given sample. If synthetic samples are poor replicas of their sample, it follows the sample's g(y) will be a poor approximation of multivariate normality (i.e., the sample is not in the latent normal class).
We evaluated how well the inverse PCA transformation preserved the covariance (C y ) and the method's capability of restoring the univariate/multivariate pdfs in Y (i.e., normality comparisons) rather than in Step 2.
Step 5 Each synthetic variable in Y is inverse mapped to X (Fig. 4, bottom left). This reversed Step 1, thereby producing the SP in X (X s ), and restoring the covariance relationships by hypothesis. The respective pdfs and covariance matrices in X were compared with those from synthetic samples; pdfs were also tested for normality.
It is important to clarify a few aspects of this work. The univariate mapping from X to Y creates a set of univariate marginals normally distributed that can produce a multivariate normal, but not guaranteed. The in the X representation: each pdf for DS3 (solid) is compared with its corresponding pdf from synthetic data (dashes). The x-axis cites the variable name from its respective dataset and its index name parenthetically (x j ). www.nature.com/scientificreports/ comparison of univariate marginal pdfs, however, is no guarantee that the respective multivariate pdfs are reasonable facsimiles because the covariance structure has been removed. Many univariate pdf comparisons are provided between samples and synthetic samples in addition to multivariate comparisons, because these allow visualizing similarities with the above stipulations. When a given sample has the latent normal characteristic, the SP generation is greatly simplified, and then it is defined by Eq. (1). The work below shows how to generate synthetic data when this characteristic holds. We use three datasets that were selected pseudo-randomly. In the space of samples (virtually unlimited), we do not know the probability that a sample selected at random will have this latent normal characteristic. The main objectives are to present the analysis components with the methods for testing for the latent characteristic, give a thorough investigation, demonstrate that the synthesis produces realistic data when this latent condition exists, and then discuss further analyses.
Study data. Samples were derived from two sources of measurements: (1) mammograms and related clinical data (n = 667), and (2) dried beans (n = 13,611) 28 . Most technical aspects of these data are not relevant for this report. Mammography data included all observations with mammograms from a specific imaging technology, thereby defining n = 667. We used the dried bean data to add variation to the analyses as the variable nomenclatures are very different from the mammogram data, noting at this point the source of data is not germane. From mammograms, we considered two sets of measurements each with d = 10 variables referred to as Sample 1 (DS1) and Sample 2 (DS2). DS1 has 8 double precision measurements from the Fourier power spectrum in addition to age and BMI, both captured as integer variables. The Fourier attributes are from a set of measurements described previously 29 ; the first 8 measurements from this set are labeled as P i for i = 1-8. These Fourier measures are consecutive and follow an approximate functional form 30 , and thus represent variables that are very different than those in DS2 (or Sample 3 below). To cite the covariance quantities and correlation coefficients, we used a modified covariance (covariance for short) table format for efficiency because C k is symmetric. In these tables, entries below the diagonal are the respective correlation coefficients, whereas the elements along the diagonal (variance quantities) and above are the covariance quantities. The covariance table for DS1 is shown in Table 1a. DS2 contains 8 double precision summary measurements derived from the image domain: mean, standard deviation (SD); SD of a high-pass (HP) filter output, SD of a low-pass (LP) filter output; local SD summarized; P 20 Fourier measure (from the set described for DS1 measurements); local spatial correlation summarized 31 ; and breast (Br) area measured in cm 2 . Age and BMI (from DS1) were also included in this dataset. These variables were selected virtually at random to give d = 10 and possibly provide a different covariance structure than DS1. The covariance table is shown in Table 2a. Neither DS1 nor DS2 were used in our related-prior mKDE synthetic data work. Selected measures and realizations from the dried bean dataset 28 are referred to as Sample 3 (DS3). The bean data has 17 measurements (floating point) from 7 bean types. We selected 10 measures at random to make the dimensionality of DS3 compatible with the other two datasets giving this set of variables: area (1); minor axis (5); eccentricity (6); convex area (7); equivalent diameter (8); extent (9); solidarity (10); roundness (11); shape factor 3 (15), and shape factor 4 (16). Here, parenthetical references give the variable number listed in the respective resource (see 28 ). Both bean type (bean type = Sira, with n = 2636) and n = 667 observations were selected at random to create DS3. Keeping n = 667 constant across datasets permits consistent statistical comparisons. For example, confidence intervals (CIs) and other comparison metrics are dependent upon the number of observations. The covariance table is shown in Table 3a. The analysis of three samples supports the evaluation of the processing scheme under generalized scenarios. The means and standard deviations for x j in each dataset are provided in Table 4. Note, the dynamic range of the means and standard deviations within a given sample vary widely in some instances.
KDE, optimization, and mapping. The mapping (Step 1) relies on generating synthetic x j with uKDE.
Each bandwidth parameter was determined with an optimization process wherein synthetic data from uKDE was compared with the sample. There is a continued feedback loop between the sample, synthetic data generation, and comparison during the optimization process. When the optimization was completed, a given map was constructed. b P 1 (y 1 ) P 2 (y 2 ) P 3 (y 3 ) P 4 (y 4 ) P 5 (y 5 ) P 6 (y 6 ) P 7 (y 7 ) P 8 (y 8 Table 1. Covariance and correlation for Dataset 1: in both tables, entries on the diagonals and above give covariance quantities. Entries below the diagonals (bold) provide the respective Pearson correlation coefficients. Table 1a gives the X representation quantities and 1b the Y representation quantities. The covariance quantities were generated from the respective sample. Parenthetically, 95% confidence intervals generated from synthetic samples are cited below the respective covariance quantity. Variables are cited with the names used in their respective resource and with the names used in this report parenthetically. Step 1 and as a modification to our previous work 22 , uKDE was used to generate realizations from each p j (x j ) given by where x j is the synthetic variable for this discussion, x ij are observations from a given sample, k j is a normalization factor, and h j is the univariate bandwidth parameter.
Optimization. Differential evolution optimization 27 was used to determine each h j in Eq. (2). The population of candidate h j evolves over generations to a solution, as described in detail previously 22 . To form generation zero for a given x j , the DE-population (n = 1000) was initialized randomly (uniformly distributed) within these bounds: (0.0, 4 × the variance of x j ) because a given range should span the solution for h j . The DE-population size stays constant across all generations. By expectation, the populations become more fit according to the fitness function over generations. A given generation is found by comparing two-DE populations (1000 pair-wise competitions) of candidate h j solutions derived from the previous generation. For a given pairwise competition, two respective SPs were generated for each candidate h j . Synthetic samples (n = 667) were drawn from each SP and P j (x j ) from the sample was compared with each P j (x j ) derived from its synthetic sample using Eq. (2). The h j candidate [used in Eq. (2)] that produced a smaller D j was used to populate the current generation, where D j is the difference metric derived from the fitness function. This process was repeated for 30 generations. In summary, 30 × 2 × 1000 synthetic samples were compared with the sample via P j (x j ) comparisons for a given x j to derive its respective h j used in Eq. (2). A given X to Y map was then constructed with x j sampled from Eq. (2) with h j = E[terminal population of candidate h j ]. As a modification, the Kolmogorov Smirnov (KS) test 32 was used as the fitness function in the DE optimization. This is a nonparametric test that can be used to compare two numerically derived cumulative probability functions or compare a numerically derived curve to a reference 32 . Here we compare the respective numerical univariate cumulative probability functions derived from synthetic samples with those derived from the sample. The difference metric, D j , for the KS test is the absolute maximum difference between the two cumulative probability functions under comparison.
Mapping. For each map in Step 1, we solve P j (x j ) = G j (y j ) numerically with interpolation methods described previously 33,34 , where P j (x j ) and G j (y j ) are assumed to be monotonically increasing. This solves the random variable transformation for each y j analogous to histogram matching with double precision accuracy. Synthetic y j (n = 10 6 ) were generated as standardized normal random variables using the Box-Muller (BM) method. Maps from X to Y are expressed as y j = m j (x j ), where m j is the jth map. The corresponding inverse maps, x j = m −1 j (y j ) , were derived numerically by inverting a given map and solving for x j . The map construction was complemented by generating synthetic x j with Eq. (2) using h j derived from DE optimization. Synthetic x j generated here were not used further.
Synthetic population generation. Synthetic populations (SPs) were generated in the uncorrelated T representation and converted back to X via Y. In Step 2, the PCA transform for the sample is given by where P is a d × d matrix with uncorrelated normalized columns. These are the normalized eigenvectors of C y that capture the sample's covariance structure. C t is diagonal with: c jj = σ 2 j (t) corresponding to the ordered eigenvalues of C y . We make the approximation that r(t) from the sample has the multivariate normal form expressed as When the multivariate normality approximation holds in Y, it should hold in T. In Step 3, synthetic t j were populated as zero mean independent normally distributed random variables with variances = σ 2 j (t) using the BM method, producing the SP in T (T s ). The row length of T s defines the number of realizations in each SP and is

Mean (y 1 ) SD (y 2 ) HP filter (y 3 ) LP filter (y 4 ) Local SD (y 5 ) P 20 (y 6 )
Local  Table 2. Covariance and correlation for Dataset 2: in both upper and lower tables, entries on the diagonals and above give covariance quantities. Entries below the diagonals (bold) provide the respective Pearson correlation coefficients. Table 2a gives the X representation quantities and 2b the Y representation quantities. The covariance quantities were generated from the respective sample. Parenthetically, 95% confidence intervals generated from synthetic samples are cited below the respective covariance quantity. Variables are cited with the names used in their respective resource and with the names used in this report parenthetically. With this process, g(y) = g n (y) for synthetic data. By premise, the covariance of Y s should be like that of Y. In Step 5 to produce the SP in X (X s ), synthetic y j were inverse mapped. Similarly, the covariance of X should be like that of X s . An example is also provided to illustrate that X s is densely populated in contrast with the sparse sample in X.
Statistical methods. The goals are to evaluate the latent normal characteristic and to produce synthetic data that is statistically like its sample. This analysis is based on both multiple univariate/multivariate pdf and covariance comparisons. A given synthetic sample (n = 667) was drawn at random from its SP. The same realizations from a given synthetic sample were used for comparisons in X, Y, and T, when applicable.
Probability density function comparisons. Univariate pdf comparisons. The KS test (described in Step 1) was used for all univariate pdf comparisons. For such comparisons, we selected the test threshold at the 5% significance level as the critical value. In X, p j (x j ) from the sample were tested for normality and compared with their respective pdfs from synthetic samples created in Step 5. In Y, g j (y j ) from the sample were compared with the respective pdfs from their synthetic samples produced by Step 4; this implicitly evaluated univariate normality in Y because synthetic y j were derived from a multivariate normal process in T s . In T, we compared r j (t j ) from the sample with their respective pdfs derived from zero mean normally distributed random variables (i.e., synthetic t j ) with variances = σ 2 j (t) [from Step 3]. For each t j , the sample was compared with 1000 synthetic samples, and the percentage of times that measured D j was less than the critical test value was tabulated.
Distribution free multivariate pdf comparisons. To evaluate whether the sample and synthetic samples were drawn from the same distribution without assumptions, we used the maximum mean discrepancy (MMD) 35 test. This is a kernel-based (normal kernel) analysis that computes the difference between every possible vector combination between and within two samples (excluding same vector comparisons). To determine the kernel parameter for these tests, we used the median heuristic 35,36 . This analysis is based on the critical value (MMD c ) at the 5% significance level and the test statistic (MMD 2 u ) . Both quantities are calculated from the two samples under comparison. This test has an acceptance region given the two distributions are the same: MMD 2 u < MMD c (see Theorem 10 and Corollary in 35 ). This test was applied in X, Y, and T. Note, when applying this test in either T or Y, it is implicitly testing the sample's likeness with multivariate normality. In X, Y, and T, 1000 synthetic samples were compared with the sample. The test acceptance percentage was tabulated. MMD 2 u and MMD c values are provided as averages over all trials because they change per comparison.
Random projection multivariate normality evaluation. Random projections were used to develop a test for normality. The vector w with d components is multivariate normal if the scalar random variable, z = u T w, is univariate normal, where u is a d component vector with unit norm that is defined as a projection vector in this report 37,38 . As mentioned by Zhou and Saho 38 , we developed this formulism into a specific random projection test. To actualize such a test to probe the samples and synthetic samples similarity with normality, the projection vector u was generated randomly 1000 times, referenced as u s . Here, s is the projection index ranging from [1,1000]. The projection equation is then expressed as where z|s defines the scalar z conditioned upon s. In Eq. (6), x, y, or t was substituted for w, and given projection was taken over all realizations (i.e., n = 667) of given sample. These realizations of z|s were used to form the normalized histogram that approximates the conditional pdf for the left side of Eq. (6) defined as f(z|s). A (5) Y s = T s P T .

Shape factor 3 (y 9 )
Shape factor 4 (y 10 )  Table 3. Covariance and correlation for Dataset 3: in both tables, entries on the diagonals and above give covariance quantities. Entries below the diagonals (bold) provide the respective Pearson correlation coefficients. Table 3a gives the X representation quantities and 3b the Y representation quantities. The covariance quantities were generated from the respective sample. Parenthetically, 95% confidence intervals generated from synthetic samples are cited below the respective covariance quantity. Variables are cited with the names used in their respective resource and with the names used in this report parenthetically. www.nature.com/scientificreports/ different series of u s was produced for each representation; once a given series was produced, u s remains fixed. The components of u s were generated as standardized normal random variables, where u s was normalized to unit norm. For a given sample, f(z|s) was tested for normality using the KS test. This procedure was repeated for all random projections (all s), resulting in 1000 KS test comparisons for normality. The percentage of the times that the null hypothesis was not rejected was tabulated as the normality similarity gauge. We refer to this procedure as the random projection test. It is the percentage of times that w was not rejected when probed in 1000 random directions. This test was performed once for each sample and with 100 synthetic samples and averaged. Here, the same u s series used to probe a given sample was also used to probe 100 of its respective synthetic samples. We note, synthetic samples are multivariate normal in Y and T by their construction. Tests were performed on synthetic samples (Y and T) to give control standards as normal comparators. Tests were also performed in X: (1) as control comparator for the test itself; and (2) to determine if a given sample was multivariate normal before undergoing the mapping.
To gain both insight into Eq. (6) and the test, expressions for both z|s and f(z|s) are developed. First, z|s results from a linear random variable transform given by where u k are the components of u s , and h k (w k ) are the univariate pdfs for scaled w k . To check one endpoint, we assume w k are independent as a coarse approximation to our samples. Then, f(z|s) results from repeated convolutions given by where c −1 = u 1 × u 2 × u 3 × u 4 × … × u d , and h k (w k ) ~ h k . If some h k (w k ) have relatively much larger variances (widths) than others, their functional forms can tend to predominate Eq. (8).
Mardia multivariate normality test. This is a two component test that uses multivariate skewness and kurtosis for evaluating deviations from normality 37 , applied in X, Y and T. It produces a deviation measure for each component as well as a combined measure; we cite the component-findings. This test was applied in X as a control. Outlier elimination techniques were not applied.
Covariance comparisons. Two methods were used to evaluate the covariance similarity between samples and synthetic samples: with (CIs) and eigenvalue comparisons.
Comparisons with confidence intervals. Each covariance matrix element between the sample and its respective synthetic samples was compared with CIs. We assumed the sample and synthetic samples were drawn from the same distributions. We used the elements from each C x and C y as point estimates from a given sample in both X and Y. One thousand synthetic samples (n = 667) were used to calculate 1000 covariance matrices (in X and Y). For each matrix element, the respective univariate pdf was formed, and 95% CIs were calculated. This procedure was repeated 1000 times. The percentage of times the sample's point estimate (for each element in C y and C x ) was within the synthetic element's CIs was tabulated.
Comparisons with PCA. The eigenvalues from C y were used as the reference comparators under two conditions. For condition 1, the PCA transform determined with the sample was applied to a synthetic sample (sample/syn test). Synthetic eigenvalues were estimated by calculating the variances of synthetic t j . For condition (7) z|s = u 1 w 1 + u 2 w 2 + u k w k + · · · + u d w d ,

Results
Univariate normality analysis in the X representation. Figures 1, 2 and 3 show the univariate pdfs (solid) for each sample in X. Of note, many pdfs are observably non-normal, usually right skewed. Each x j in DS1 showed significant deviation from normality (p < 0.0001) except for x 9 . In DS2, neither x 1 or x 9 (x 9 is the same in DS1) showed significant deviation from normality, while the remaining x j exhibited significant deviations (p < 0.0003) except for x 3 (p = 0.0144). In DS3, x 1 through x 5 and x 9 did not show significant deviations from normality, the remaining x j deviated significantly (p < 0.002).
Mapping and KDE optimization. Mapping. Figure 5 shows an example of the X to Y map for y 9 in the left-pane and the inverse Y to X map in the right-pane. Red-dashed lines show the map and its inverse constructed without synthetic x j . Staircasing effects are observable particularly in the tail regions, where sample densities are sparse. Black lines show the map and its inverse constructed with n = 667 (the sample) plus n = 10 6 synthetic realizations produced with optimized uKDE. Staircasing effects were removed when incorporating synthetic x j , which was common with all maps and inverses (not shown).
uKDE optimization. The optimization produced bandwidth parameter solutions (h j ) used in Eq. (2). Here, we illustrate the evolution of the solution with h 9 from DS1 and DS2 as an example. Figure 6 shows the scatter plot between the candidate h 9 population and the respective D 9 (KS test difference metric) for DE generation = 1 in the left-pane and for the terminal generation = 30 in the middle-pane. The solution space (middle-pane) is tightly clustered indicating DE convergence. A closer view of this cluster is shown in the right-pane of Fig. 6. This relatively tight-cluster characteristic was common among all variables and datasets (not shown).

Univariate comparisons between samples and synthetic samples. Comparisons in T.
These findings are summarized first because they start the flow back to X and can show departures from normality. Figures 7, 8 and 9 show the pdfs for the samples (solid) compared with their corresponding synthetic pdfs (dashed). Table 5 shows the variances in T for each sample (i.e., eigenvalues for each sample). Due to (1) the normalization in Y, and (2) that d = 10, multiplying a given, σ 2 j (t) by 10% gives the percentage of the total variance explained by its t j . Table 6 shows the KS test findings for the univariate normality comparisons. Here we use a cutoff of < 65% to indicate deviation as most trends were well above this boundary. As shown in Table 6 (left column for each dataset): (1) the normal model did not deviate for any t j in DS1 (7 t j were < 94%); (2) the normal model deviated for t 10 in DS2 (5 t j were < 94%); and (3) the normal model deviated for t 7 , t 8 , t 9 , and t 10 in DS3 (3 t j were < 94%). In DS2, t 10 explains about 0.2% of the total variance. Similarly, in DS3, the sum of the variances of the four variables (t 7 , t 8 , t 9 , and t 10 ) constituted about 0.14% of the total variance.  Figures 10, 11 and 12 show the pdfs in Y resulting from the mapped samples (i.e., mapped x j ) for each dataset (solid) compared with their respective synthetic pdfs (dashed), which are normal by construct. Comparisons in Y showed little departure from normality in any sample, as the tests were not rejected (about 99%) in most instances (Table 6, middle column for each dataset). Findings from DS2 and DS3 indicate that substituting normal pdfs in T whenever sample t j deviated from normality had little influence on this analysis. This may be because these respective variables in total or isolation explained a minute portion of the variance in the respective PCA models.
Comparisons in X. Figures 1, 2 and 3 show the pdfs in X for the samples (solid) compared with their corresponding synthetic pdfs (dashed). The pdfs from the sample did not deviate from their corresponding synthetic pdfs in any dataset, as the tests were not rejected (< 99%) in most instances (Table 6, right columns). The parenthetical entries in Table 6 show the test findings without using synthetic data for the map/inverse constructions (using the samples only). These show that complementing the map constructions with uKDE is a necessary component of this methodology, although the degree of deviation from the KS tests varied across datasets. Note, the improvement held in D3 as well, which was normal in X (shown below).

Multivariate comparisons and normality comparisons. MMD tests.
Testing was performed in X, Y, and T, and the test metrics are provided in Table 7. These show the samples and respective synthetic samples were drawn from the same distributions. In 100% the tests, measured MMD 2 u values were less than the critical MMD c quantities. The MMD tests in Y and T were also proxy tests for sample-normality due to the SP constructs.
Random projection normality tests. Testing was performed in X, Y, and T, and the findings are shown in Table 8. Test findings were mixed for the samples in X and were not rejected for approximately these instances: 40% in DS1; 74% in DS2; and 99% in DS3. Thus, DS3 is better approximated as normal in X compared to the other samples. The tests for synthetic samples in X tracked the findings for their respective samples: 44%, 74% and 99% respectively. In Y, the tests for the samples were not rejected for about these instances: 99% in DS1, 97% in DS2, and 95% in DS3, whereas the test for the synthetic samples should no deviation from normality. Similarly in T, the tests for the samples were not rejected for about these instances: 99% in DS1, 94% in DS2, and 95% in DS3. There is a difference in the X and Y analyses because the mapping normalizes the variables in Y. In DS3, the standard deviations vary over many orders of magnitude (see Table 4). As shown by Eq. (8), variables in DS3 with the larger standard deviations may wash out the other variables; the variables in X that had normal marginals compared with those that were not, indicates that a portion of the normal marginals had much larger standard deviations (in Table 4, see x 1 -x 4 ). As another control experiment, we standardized all variables in X to zero mean and unit variance and performed the tests again. The tests for the samples gave: 32.9%, 76.4%, and 75.2% for DS1, DS2, and DS3, respectively. For synthetic samples, these tests gave: 33.6%, 80.1% and 89.1% for DS1, DS2, and DS3, respectively. Note, centering the means alone had no influence on the findings as expected (data not shown). Thus, normalizing the univariate measures can influence the likeliness with normality by virtue of Eq. (8). In sum, these tests show all samples resemble multivariate normality in both Y and T and that the sample for DS3 resembles normality in X without mapping (without first normalizing the variances).

Mardia normality tests.
Testing was performed in X, Y, and T. The findings are shown in Table 8. In X, the samples and synthetic samples all deviated from normality (both skewness and kurtosis). In both Y and T, the samples showed significant deviations from normally in all tests. In contrast, synthetic samples did not deviate significantly from normality in any test in Y or T, as expected.
Covariance comparisons. Covariance matrix comparisons with confidence intervals. Test findings are provided in Tables 1, 2 and 3 for the respective datasets. Part-a of each table shows the X quantities, and part-b shows the corresponding Y quantities. For DS1 (Table 1), covariance references (sample) were within the CIs of the synthetic data for 100% of the trials in both X and Y. For DS2 (Table 2), most references agreed with the synthetic elements except for two entries in X (Table 2a). From the 1000 trials, the x 2 x 3 covariance was out of tolerance for 0.1% of the instances, and the x 5 x 10 covariance was out for 18.7% of instances. For DS3, all covariance references were within tolerance except the x 7 x 10 covariance, which was out of tolerance for 100% of the instances. In tests that showed more deviation (percentage > 0.1%), the reference covariances were approximately zero. Table 5. This table is separated into three sections vertically. Reference eigenvalues are provided in the top row of each section. Eigenvalues calculated from the sample/syn (condition 1) and syn/sample (condition 2) are provided in the middle and bottom rows of each section, respectively. F-tests were not significant (p > 0.05) in any comparison with the references indicating similarity.

Eigenvalue comparison tests. Eigenvalues are provided in
Sample sparsity and synthetic population space filling. This illustration demonstrates that the approach fills in the multidimensional space with synthetic realizations derived from a relatively sparse sample. We selected a synthetic entity at random from DS1 giving this vector: x T = [4.20, 2.08, 1.61, 1.15, 0.85, 0.67, 0.54, 0.44, 52.0, 23.6]. We selected x 1 and x 8 as the scatter plot variables. For the other 8 components, all synthetic realizations within x ij ± ½ σ j (the standard deviation for x j ) were selected and viewed in the x 1 x 8 plane as a scatter plot. The same vector and limits were used to select realizations from the sample. The plots are provided in Fig. 13 for comparison. The sample (left-pane) produced n = 24 realizations, whereas the SP (right-pane) pro- www.nature.com/scientificreports/ duced n = 36,398 realizations. These plots illustrate the sample's relative sparsity and that the synthetic approach produces a dense population with observations that did not exist in the sample.

Discussion
The work involved several steps to generate synthetic data from arbitrarily distributed samples. To the best of our knowledge, new aspects and findings from this work include: (1) demonstrating a class of arbitrarily distributed samples has a latent normal characteristic, as exhibited by two of the samples; (2) conditioning the input variables with sample size augmentation and then constructing univariate transforms so that known techniques could be applied to generate synthetic data; (3) deploying multiple statistical tests for assessing both normality and general similarity in both the univariate and multivariate pdf settings; (4) developing methods for comparing covariance matrices; and (5) incorporating differential evolution (DE) optimization for uKDE bandwidth determination based on the KS fitness function. The related findings are discussed below in detail.
A method was presented that converts a given multivariate sample into multiple 1D marginal pdfs by constructing maps. These X-Y maps were constructed by augmenting the sample size with optimized uKDE. Performing the analysis with and without data augmentation improved the marginal pdf comparisons between the samples and synthetic samples; this also held in DS3 (four x j ), which was approximately normal in X. PCA applied to standardized normal variables in Y produced uncorrelated variables in T, where synthetic data was generated. This approach essentially decouples the problem into the covariance relationships (in P and its inverse) and 1D marginal pdfs (i.e., approximate parametric models in Y and T). This decoupling is similar to the objective of Copula modeling that follows from Sklar's work 39,40 . Copula modeling allows specification of marginal pdfs and the correlation structure independently 41 ; in this approach, the marginals must be specified accurately and finding analytical solutions for d > 4 is difficult 42 . In contrast with Copula modeling, which is flexible, the covariance (or correlation) structure in our approach is fixed by the normal form and empirically derived; the marginals were forced to normality rather than specified. As a benefit, the CIs for C k with our approach were estimated from the pdfs for each matrix element without assumption other than the normal calculation form. Additionally, the eigenvalue comparison technique results reinforced the CI comparison findings. Outside of the multivariate normal situation it is not clear when (1) comparing the marginals with one set of tests, and (2) comparing the covariance relationships separately with another set of tests results in a good overall empirical comparison-approximation between two multivariate samples. Such situations will require further analyses.
There are several other points worth noting about this work. An empirically driven stochastic optimization technique was used to estimate the uKDE bandwidth parameters for the map/inverse constructions. The relative efficiency of the approach is an important attribute in that it only requires multiple uKDE applications rather than mKDE. The number of generations in the optimization was fixed. This can be changed easily to a variable termination based on achieving a critical threshold or applying other appropriate fitness functions; for example, the stopping criteria could be based on the critical distance in the KS test or the change in this distance from one generation to the next. Likewise, there are plug-in kernel bandwidth parameters that can be used statically. These are derived by considering closed form expressions containing the constituent pdfs and minimizing the asymptotic behavior of either the mean integrated square error or mean squared error 43 . We explored such parameters 44 , but they did not perform as well as the KS test with DE, notwithstanding the number of computations used here to determine a given bandwidth parameter. Of note, the KS test has limitations, as it is more sensitive to the median of the distribution rather than the tails. As an alternative, the Anderson Darling test is a variant of the KS procedure that is sensitive to the distribution tails 32 . The mapping from X to Y standardized the problem at the univariate level, but in general there is no guarantee that collectively it produced a multivariate normal in Y. Testing performed in Y (Step 2) could be used to discriminate input samples that have the latent normal characteristic from those that do not. The random project test could be developed into a gauge at this step for assessing the deviation from normality. Moreover, comparing r(t j ) from the sample against normality (following Step 3) also provides a basis for testing sample's likeness to a multivariate normal (discussed below). When p(x) is approximately multivariate normal, as in DS3, the mapping is not required and generating synthetic data based on PCA (without the mapping step) is a practiced technique; our approach addresses the case when this approximation fails to hold. The random projection tests changed the similarity with normality when standardizing the samples. Thus, the purpose for generating synthetic data should be considered before adjusting the input sample.
There are several other limitations and qualifications worth noting. Several multivariate pdf tests were examined with mixed findings. MMD tests in X, Y, and T showed each sample was statistically similar with its respective synthetic sample(s). These tests also indicated normality in Y and T (by default). This MMD test is sensitive to changes in the mean. In our processing, all means were forced either to identically zero or to statistical similarity via mapping. Likewise, the heuristic used for the kernel bandwidth determination can be less than optimal under certain conditions, decreasing the MMD test performance 45 . Random projection tests in Y and T indicated that the samples did not deviate from normality in most instances, whereas synthetic samples showed essentially no deviation. Understanding the acceptable departure from normality for this test in the modeling context will require more work. This test also showed DS3 was approximately normal in X. In contrast, Mardia tests showed all samples deviated significantly from normality in X, Y and T. With the Mardia test, synthetic samples showed: (1) essentially no deviation in Y or T as expected; (2) and significant deviation in X. Here, we made no attempt to mitigate possible outlier interference when analyzing the samples 46 . Note, testing for multivariate normality is not a trivial task; many of the complexities are covered by Farrell et al. 47 The conclusions we make from these tests indicated each sample was approximately multivariate normal in Y and T, noting the approach may not be dependent upon this characteristic as elaborated below. In planned research, these approximations will be tested in the modeling context to evaluate whether sample and synthetic www.nature.com/scientificreports/ data are interchangeable. When this normality approximation holds, it implies that the original multivariate pdf estimation problem in X was converted to a parametric normal model described by Eq. (1), which simplifies the synthetic data generation. If this conversion generalizes to other datasets (at least in part), it implies that some class (subset) of the multivariate sample space can be studied with simulations by altering, n, d, and the covariance matrix to that of an arbitrary sample. Future work involves investigating arbitrary selected samples to understand how often this latent normal characteristic is present. Alternatively, analyzing the samples in T may provide another method for comparing datasets, evaluating similarity, evaluating normality in Y, or generalizing our approach. The marginals from each sample were approximated as univariate normal in T, although there were noted variations. For example, and as noted, about 99% of the total variance came from the first four variables in DS1 (see Table 5). DS2 and DS3 were found to be similar with the first four variables accounting for 90-92% of the total variance. Thus, DS1 is more compressible Figure 5. Univariate Mapping Illustration: this shows the map (left) and inverse map (right) for age (x 9 ) used in both DS1 and DS2. Maps using the sample only (red-dashes) are compared with maps augmented with synthetic data (black-sold).

Figure 6.
Kernel Density Estimation Optimization Illustration: this shows the differential evolution optimization to determine h 9 (for x 9 , age from DS1 and DS2). These show the scatter plots of the Kolmogorov Smirnov test metric (measured D 9 quantities for entire generation) versus the h 9 quantities for two generations: generation = 1 (left); terminal generation = 30 (middle); and closeup view of the terminal generation (right). The terminal generation shows the candidate solutions for h 9 are tightly clustered (compare left pane with middle and right panes). Here, we did not encounter non-normal variables (from the samples) in T that explained a significant portion of the total variance. When a given sample is well approximated as multivariate normal in Y, the PCA transformation will produce univariate normal marginal pdfs in T. This step could be developed into the definitive test for multivariate normality in Y by understanding the residual error of the non-normal marginal pdfs in T. In this work, the analysis in T supports the normality findings for each sample because the residual non-normal errors were parasitic. Future work will investigate: (1) the impact of the residual error in T on normality in Y, and (2) causes for normality in T, i.e., possibly due to forced normality in Y, some characteristic of the X representation data, or the PCA transform. If required, the technique could be generalized to accommodate non-normal marginals in T. As a generalization, uKDE will be investigated for generating univariate non-normal distributions www.nature.com/scientificreports/ in T when called for with the same method used to augment X for the map/inverse constructions. In this sample scenario, the Y description will deviate from Eq. (1). Although, this premise will have to be investigated because the lack of correlation in T only guarantees t j independence when r(t) is multivariate normal. We speculate when the sample has low correlation between most of the bivariate set in X, this approximation may hold.
Choosing the most appropriate space to perform modeling or to analyze the samples deserves consideration. We have used the covariance form suitable for normally distributed variables. In Y, this form is likely appropriate. We used the same form in X as well; this form may not be optimal here because covariance relationships are not preserved over non-linear transformations. It is our contention that Y is best suited for modeling because the www.nature.com/scientificreports/ marginals are normal. It is common practice in univariate/multivariate modeling to adjust variables (univariately) to a standardized normal form or apply transforms to remove skewness. The X to Y map converted each x j to unit variance. When the natural variation for x j is important, the mapping can be modified easily to preserve the variance. If the variable interpretation is not important, modeling can also be performed in T. The method in this report addresses the small sample problem given the sample has the latent normal characteristic or is normal. The approach will require further evaluation on different datasets to understand its general applicability and when the univariate mapping from X to Y approximately produces a multivariate normal. Multiple methods were explored to evaluate multivariate normality. These tests indicated that the samples Figure 9. Marginal Probability Density Functions (pdfs) for DS3 in the T representation: each pdf for DS3 (solid) is compared with its corresponding pdf from synthetic data (dashes).   Table 6. Univariate Kolmogorov Smirnov (KS) tests: in each representation, probability density functions (pdfs) derived from synthetic samples were compared with the respective pdfs derived from the sample for each variable. The KS test was applied 1000 times for each comparison. The number of times the null hypotheses was not rejected for a given test was tabulated as a percentage (all entries are percentages). Due to the experimental arrangements for y j and t j , each test is equivalent to testing for normality as well. Parenthetical entries show the results when not using kernel density estimation to supplement the X-Y map constructions.  www.nature.com/scientificreports/ approximated normality in both the Y and T but also showed some deviation from normality. The interpretation of these findings in the context of data modeling may aid in understanding the limits of both the multivariate SPs and normality approximations in this report's data and beyond. For example, determining the limiting percentage of the random projection tests may be informative in the modeling context. In summary, we offer a definition for an insufficient sample size in the context of synthetic data. When considering a given sample with d attributes and specified covariance structure, a sample size that does not allow reconstructing its population can be considered as insufficient. In future work, we will apply the methods in this report to understand the minimum sample size, relative to d and a given covariance structure, that permits recovering the population. www.nature.com/scientificreports/ Table 8. Multivariate normality tests: for the random projection test, 1000 random projection vectors (1000 tests) were applied to a given sample and each test result was compared against normality; the percentage of times the test was not rejected was tabulated. For synthetic (syn) data, the same 1000 projection vectors were used to test 100 synthetic samples from a given dataset. The percentage of times the test was not rejected was tabulated for each synthetic sample and percentages were averaged over the 100 synthetic samples. Mardia tests were applied to the sample and to a synthetic sample selected at random.

Data availability
The link to publicly available data is provided in text. Mammography summary data can be obtained upon request to the corresponding author: John Heine (john.heine@moffitt.org). Kernel parameters are also available upon request. Using the same vector and limits, individuals were selected from the sample in the same manner. The scatter plot for the sample (n = 24) is shown in the left-pane and from the synthetic population (n = 36,398) in the right-pane. Variables are cited with the names used in their respective resource and with the names used in this report parenthetically.